Goto

Collaborating Authors

 top-k operation




Differentiable Top-k with Optimal Transport

Neural Information Processing Systems

Finding the k largest or smallest elements from a collection of scores, i.e., top-k operation, is an important model component widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulted model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We then apply the proposed operator to k-nearest neighbors algorithm and beam search algorithm. The numerical experiment demonstrates their achieve improved performance.






Differentiable Top-k with Optimal Transport

Neural Information Processing Systems

Finding the k largest or smallest elements from a collection of scores, i.e., top-k operation, is an important model component widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulted model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator.


Differentiable Top-k Operator with Optimal Transport

Xie, Yujia, Dai, Hanjun, Chen, Minshuo, Dai, Bo, Zhao, Tuo, Zha, Hongyuan, Wei, Wei, Pfister, Tomas

arXiv.org Machine Learning

The top-k operation, i.e., finding the k largest or smallest elements from a collection of scores, is an important model component, which is widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulting model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely the SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of the top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We apply the proposed operator to the k-nearest neighbors and beam search algorithms, and demonstrate improved performance.


Sparse Weight Activation Training

Raihan, Md Aamir, Aamodt, Tor M.

arXiv.org Machine Learning

They have an indexing unit for enabling the sparse multiplication. The computations are spatially mapped and scheduled to these processing units by a control and scheduling logic. Each of the PE generates partial products which get accumulated to compute the output values and finally stored in the DRAM. Mapping Computations: Let us consider a convolutional layer, which maps the input activations in ( R N C H I W I) to out ( R N F H O W O). The layer computes F channels of output feature maps, each of dimension R H O W O, using C channel of input feature maps of dimension R H I W I for each of the N samples in the mini-batch. The layer has parameter w R F C H K W K . Algorithm 1: Dense Forward Pass Computation for a single input sample (Assuming Stride 1)The data: w,in The result: out for h o 1 to H O do for w o 1 to W O do for f 1 to F do for c 1 to C do for h k 1 to H K do for w k 1 to W K do c c; h h o h k; w w o w k; out[f ][h o][w o] w [ f ][c ][h k][w k] in [c ][h ][w ]); end end end end end end Thus, as shown in algorithm 1, each activation is reused F C H K W K times, each weight is reused N C H K W K times and the total computation is as follow: Dense Convolution FLOP F H O W O C H K W K (7) The first three'for' loops are independent and can be mapped independently to the PEs, whereas the inner three'for' loop generate the partial products. The different sparse accelerators have different ways of mapping the'for' loops spatially over the PEs for maximizing reuse and minimizing the data transfer to and from the DRAM.